The Rookie Report: How Rookie Performance and Background Shape NBA Careers¶

Motivation

Every year, NBA teams financially invest significant amounts of money in drafting rookies, often relying on past performance metrics, scouting reports, or the reputation of the college where they came from. But how reliable are these indicators of long-term success for players? By examining whether rookie statistics, pick in the draft, and background can predict career outcomes, we aim to provide reliable tools to predict future success that teams can use in the player evaluation process prior to draft night and during a player’s rookie season.

This research can provide valuable information to several key stakeholders. For team owners and investors, the use of data-driven methods to forecast a rookie’s career potential can improve decision-making and reduce financial risk during the draft selection process. Additionally, understanding the impact of factors, such as draft position and player origin, may enable franchises to identify undervalued talent and make strategic roster decisions. General managers, the front office, and team executives can also use our research to analyze a player’s rookie statistics and determine whether they will be worth keeping on the team.

The information generated from this project could also be useful to aspiring professional athletes. If certain college programs are shown to have stronger correlations with NBA success, it may influence a players’ decision on where to play for their collegiate career, ultimately shaping their development and future opportunities.

In this report, we aim to answer the following research questions through data analysis and visualizations:

  1. Do rookie stats predict long-term success?
  2. Is there a difference in performance trajectory between lottery picks and second-round picks?
  3. Which college or international leagues produce the most statistically successful NBA rookies?

Summary of Findings

For our research question "Do rookie stats predict long-term success?", we defined long-term success as the number of All-star selections received by each player. Using a linear regression model, our analysis determined that rookie statistics has some predictive power. Defensive stats, particularly steals and blocks, had the strongest positive correlation with future All-star selections. However, the model had a high mean squared error, indicating substantial variability and low predictive accuracy. This suggests that rookie stats alone are insufficient to reliably predict long-term success.

Our analysis for the research question "Is there a difference in performance trajectory between lottery picks and second-round picks?" reveals that lottery players often have better performance scores than second-round picks, implying that draft position determines long-term success. As the lottery summary statistics display more variance and a higher median, it indicates more chances of growth and development. Although exceptional outliers exists among the second-round players, previous draft position is closely correlated with increased likelihood of long-term NBA success.

For our research question "Which college or international leagues produce the most statistically successful NBA players?", we determined that Duke University produced the most players that achieved a highest performance statistic in one of the six categories: points per game, minutes per game, sucessful field goal percentage, 3-point percentage, free throw percentage, and career win shares per 48 minutes adjusted for career length. Additionally, players that attended Duke University, out of all the colleges, contributed the most instances of achieving the highest performance statistic in of the six metrics in a given year. We included multiple achievements in one year or multiple years of achievements in one of the six categories as multiple instances. Additionally, Duke produced the most players with multiple (2 or more) highest performance statistics in a given year.

Installing Plotly¶

As a part of our challenge goal, we decided to utilize the plotly library in our report to generate interactive visualizations. Here, we install the new library so we can use it in the report. It has been commented out so that it is not installed more than once.

In [1]:
!pip install plotly
Requirement already satisfied: plotly in /opt/conda/lib/python3.11/site-packages (6.1.2)
Requirement already satisfied: narwhals>=1.15.1 in /opt/conda/lib/python3.11/site-packages (from plotly) (1.42.0)
Requirement already satisfied: packaging in /opt/conda/lib/python3.11/site-packages (from plotly) (24.2)

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: pip install --upgrade pip

Imports¶

Here, we run all the necessary imports for our report to generate data frames, visualizations, linear regressions, and run tests.

In [2]:
!pip install -q pytest ipytest
import ipytest
ipytest.autoconfig()
import matplotlib.pyplot as plt
import pandas as pd
import io
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import plotly.express as px
import plotly.io as pio
import doctest
import numpy as np
from pandas.testing import assert_frame_equal
import plotly.graph_objects as go
pio.renderers.default = "notebook"

sns.set_theme()
[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: pip install --upgrade pip

Challenge Goals¶

  1. Multiple Datasets: For our analysis, we utilized four different datasets that contained information about rookie statistics, draft data, lottery data, and All-star data. Two of our questions (Do rookie stats predict long-term success?, Is there a difference in performance trajectory between lottery picks and second-round picks?) involved the analysis of two datasets. The first question analyzed the rookie_stats and all_star data sets, so we performed a merge of those data frames. The second question analyzed lottery and draft_data, so we performed a merge on those data frames. Both merge operations were conducted using a merge function that we created.

  2. New Library: For our challenge goal of using a new library, we used plotly to provide a clearer representation of the share of colleges that produced rookies with the highest performance statistics. We first used a pie chart to compare each college's share of high performing rookies from 1995-2019. Then, we used a bar chart to display the players who were high performing players two or more times. Lastly, we used a bubble chart to provide a scalar comparison of each college's contribution to high performing lottery picks in the NBA. This challenge goal was expanded to include a bubble chart because we wanted create different interactive visualizations on the contribution of each college to the share of high performing rookies. Users can click on each section/bubble from the pie/bubble chart to learn the number of instances of high performing rookies a school produced.

Collaboration and Conduct¶

Students are expected to follow Washington state law on the Student Conduct Code for the University of Washington. In this course, students must:

  • Indicate on your submission any assistance received, including materials distributed in this course.
  • Not receive, generate, or otherwise acquire any substantial portion or walkthrough to an assessment.
  • Not aid, assist, attempt, or tolerate prohibited academic conduct in others.

Update the following code cell to include your name and list your sources. If you used any kind of computer technology to help prepare your assessment submission, include the queries and/or prompts. Submitted work that is not consistent with sources may be subject to the student conduct process.

In [3]:
your_name = "Shayna Suzuki, Emily Trinh, Javin Choi"
sources = [
    "Learning Algorithms Lecture", 
    "Data Visualization Lecture",
    "https://plotly.com/python/plotly-fundamentals/",
    "https://plotly.com/python/getting-started/",
    "https://plotly.com/python/renderers/",
    "https://seaborn.pydata.org/generated/seaborn.regplot.html",
    "https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots_adjust.html",
    "https://htmlcolorcodes.com/",
    "Objects Lecture",
    "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html",
    "https://plotly.com/python/figure-labels/", 
    "Search Homework",
    "https://plotly.com/python/table/#cell-color-based-on-variable",
    "https://blackinblue.trinity.duke.edu/haves-and-have-nots-inequities-amongst-duke-universitys-sports-teams-0",
    "https://goduke.com/sports/mens-basketball/roster/coaches/mike-krzyzewski/4159",
    "Amath 301 Lecture on np arrays Winter 2025",
    "https://www.w3schools.com/python/pandas/ref_df_nunique.asp"
]

assert your_name != "", "your_name cannot be empty"
assert ... not in sources, "sources should not include the placeholder ellipsis"
assert len(sources) >= 6, "must include at least 6 sources, inclusive of lectures and sections"

Data Setting and Methods¶

Loading Datasets:¶

Here, we load the following datasets that will be used to answer our research questions through data analysis:

  1. NBA Draft Basketball Data 1989-2021: This dataset (CSV file) contains the NBA Draft picks from 1989-2021. It has information about the year drafted, draft rank, NBA overall draft pick, team that drafted the player, college the player attended, years active in the NBA, number of games played in the NBA, and minutes played in the NBA.

  2. NBA Lottery Picks from 1995-2020: This dataset (CSV file) contains the NBA Draft picks from 1995-2020 and information about players’ careers. It contains the players’ draft years, pick number in the draft, team that drafted the player, college the player attended, amount of seasons played in the NBA, career games played, career minutes played, and career points scored.

  3. NBA Rookies Performance Statistics and Minutes: The dataset (CSV file) contains performance statistics for NBA rookies from 1980-2016. It includes the rookie’s draft year, number of games played, number of minutes played, number of points scored, number of field goals made, number of field goals attempted, field goal percentages, and number of three pointers made.

  4. NBA All-Stars: The dataset (CSV file) contains information about players who were selected to participate in NBA All-Star games. It includes information about the number of All-Star games players were selected for, their Hall of Fame status, and their position.

All data used in this report was obtained from Kaggle, an open-source platform where users share publicly available datasets. The datasets used in our analysis were compiled and uploaded by Kaggle contributors, who may have sourced the data from official NBA statistics sites, sports analytics platforms, or manual curation. While Kaggle is a convenient and widely used resource, it is important to note that the accuracy and completeness of the data depend on the original uploader, and potential inconsistencies or omissions may exist.

In [4]:
def load_data(rookie_stats, lottery, all_star, draft):
    """
    Takes in four file paths and returns data frames of rookie stats, all star stats, lottery picks,
    and draft data. 
    """
    all_star = pd.read_csv(all_star).dropna()
    rookie_stats = pd.read_csv(rookie_stats).dropna()
    lottery = pd.read_csv(lottery)
    draft_data = pd.read_csv(draft).dropna()
    return all_star, rookie_stats, lottery, draft_data
    
  
## Make method call to load data to load the four data sets
all_star, rookie_stats, lottery, draft_data = load_data("NBA Rookies by Year.csv", 
                                                        "bbref-scraped-lotteryData.csv", 
                                                        "NBA Hall of Famers 2021.csv", 
                                                        "nbaplayersdraft.csv")

Testing Datasets:¶

We create four small data frames, populated with data from the larger data sets, that will be used to test methods that we create to confirm that the method is functioning properly.

In [5]:
nba_data_test1 = pd.read_csv(io.StringIO("""index,Name,Year Drafted,GP,MIN,PTS,FGM,FGA,FG%,3P Made,STL,BLK,TOV,EFF,position,All_star_selections,In_Hall_of_fame,height,weight,born
0,Brandon Ingram,2016,36,27.4,7.4,2.6,7.6,34.7,0.5,0.4,0.4,1.3,7.3,F,1,2,206,86,1997
1,Joel Embiid,2016,22,24.8,18.9,6.2,13.4,46.4,1.2,0.8,2.4,3.8,18.8,C,4,2,213,113,1994
2,Jaylen Brown,2016,34,13.4,4.9,1.9,4.4,44.6,0.4,0.3,0.3,0.6,4.6,F,1,2,193,97,1992
"""))

nba_data_test2 = pd.read_csv(io.StringIO("""index,Name,College,YearsActive,mp 
0,Brandon Ingram,Duke,9,16000
1,Joel Embiid,University of Kansas,11,14500
2,Jaylen Brown,UC Berkeley, 9,15000
3,Cherokee Parks,Duke, 9,7459
4,William Avery,Duke,11,1205
"""))

nba_test_lottery_performance = pd.read_csv(io.StringIO(
"""player,Year,college_name,pts_per_g,mp_per_g,fg_pct,fg3_pct,ft_pct,ws_per_48
Brandon Ingram,2008,Duke,12.1,10,0.5,0.24,0.8,0.31
Joel Embiid,2008,University of Kansas,15.5,9.8,0.47,0.19,0.65,0.43
Cherokee Parks,2008,Duke,13,7.7,0.66,0.3,0.52,0.6
Hasheem Thabeet,2009,UConn,12.1,10,0.5,,0.78,0.32
Jaylen Brown,2010,UC Berkeley,13.2,8.7,0.51,0.25,0.67,0.09
Kwame Brown,2010,,13.2,8.7,0.51,0.25,0.67,0.09
William Avery,2010,Duke,5.7,11.3,0.72,0.33,0.72,0.4
Obi Toppin,2020,Dayton,,,,,,
Jaylen Smith,2020,Marlyand,13,,0.66,0.3,0.52,0.6
"""))

draft_data_test3 = pd.read_csv(io.StringIO(
"""id,year,rank,overall_pick,team,player,college,years_active,games,minutes_played,points,total_rebounds,assists,field_goal_percentage,3_point_percentage,free_throw_percentage,average_minutes_played,points_per_game,average_total_rebounds,average_assists,win_shares,win_shares_per_48_minutes,box_plus_minus,value_over_replacement
1,2003,1,1,CLE,LeBron James,St. Vincent-St. Mary HS,20,1400,50000,39000,10000,10000,0.504,0.344,0.737,35.7,27.9,7.1,7.4,250.4,0.220,8.9,135.6
2,2014,7,41,DEN,Nikola Jokic,Mega Basket,9,700,19000,12000,6500,4300,0.556,0.352,0.835,27.1,17.1,9.3,6.1,120.2,0.245,9.5,78.3
3,2015,13,13,PHX,Devin Booker,Kentucky,8,500,16000,9500,2200,1800,0.462,0.392,0.867,32.0,19.0,4.4,3.6,75.3,0.145,3.1,42.0"""
))
lottery_data_test = pd.read_csv(io.StringIO(
    """Unnamed: 0,pick_overall,Year,team_id,player,college_name,seasons,g,mp,pts,trb,ast,fg_pct,fg3_pct,ft_pct,mp_per_g,pts_per_g,trb_per_g,ast_per_g,ws,ws_per_48,bpm,vorp
jamesle01,1,2003,CLE,LeBron James,HS,20.0,1400.0,50000.0,39000.0,10000.0,10000.0,0.504,0.344,0.737,35.7,27.9,7.1,7.4,250.4,0.220,8.9,135.6
jokicni01,41,2014,DEN,Nikola Jokic,Mega Basket,9.0,700.0,19000.0,12000.0,6500.0,4300.0,0.556,0.352,0.835,27.1,17.1,9.3,6.1,120.2,0.245,9.5,78.3
bookede01,13,2015,PHX,Devin Booker,Kentucky,8.0,500.0,16000.0,9500.0,2200.0,1800.0,0.462,0.392,0.867,32.0,19.0,4.4,3.6,75.3,0.145,3.1,42.0"""
))

Data Merge¶

To address some of our research questions, we needed to work with multiple datasets. To streamline this process, we implemented a reusable function that merges any two data frames based on a specified column. This approach helps reduce code redundancy and improves readability, especially since we anticipate performing multiple merges for our analysis.

We then use the merge_data function generate rookie_all_star and lottery_second, two data frames that will be used in two of our research questions. We chose to merge the rookie statistics and All-star data so that we could have access to information about rookie statistics and All-star selections in the same data frame, to answer the question "Do rookie stats predict long-term success?". Additionally, we chose to merge the lottery and second round data to select and combine the performance statistic columns in both datasets and compute the overall scores for the players to conduct further analysis.

In [6]:
def merge_data(first_data, second_data, merge_column, how):
    """
    Takes in two data frames and a column name and merges the two dataframes on the provided 
    column. The merged dataframe is returned. 
    """
    result = first_data.merge(second_data, left_on=merge_column, right_on=merge_column, how=how)
    return result


rookie_all_star = merge_data(rookie_stats, all_star, "Name", "inner")
lottery_second = merge_data(draft_data, lottery, "player", "left").dropna()
display(rookie_all_star)
display(lottery_second)
index Name Year Drafted GP MIN PTS FGM FGA FG% 3P Made ... STL BLK TOV EFF position All_star_selections In_Hall_of_fame height weight born
0 0 Brandon Ingram 2016.0 36.0 27.4 7.4 2.6 7.6 34.7 0.5 ... 0.4 0.4 1.3 7.3 F 1 2 206 86 1997
1 3 Joel Embiid 2016.0 22.0 24.8 18.9 6.2 13.4 46.4 1.2 ... 0.8 2.4 3.8 18.8 C 4 2 213 113 1994
2 8 Domantas Sabonis 2016.0 34.0 21.1 6.4 2.6 6.0 43.9 1.0 ... 0.6 0.5 1.1 7.7 F 2 2 211 108 1996
3 11 Pascal Siakam 2016.0 32.0 17.8 5.1 2.3 4.3 52.5 0.0 ... 0.5 0.8 0.7 7.4 F 1 2 206 104 1994
4 23 Jaylen Brown 2016.0 34.0 13.4 4.9 1.9 4.4 44.6 0.4 ... 0.3 0.3 0.6 4.6 F 1 2 193 97 1992
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
214 1500 Bill Laimbeer 1980.0 81.0 30.4 9.8 4.2 8.3 50.3 0.0 ... 0.7 1.0 1.6 16.5 C 4 0 211 111 1957
215 1503 Kiki Vandeweghe 1980.0 51.0 27.0 11.5 4.5 10.5 42.6 0.0 ... 0.6 0.5 1.7 11.4 F 2 0 203 99 1958
216 1505 Andrew Toney 1980.0 75.0 23.6 12.9 5.3 10.7 49.5 0.1 ... 0.8 0.1 2.9 10.2 G 2 0 190 80 1957
217 1512 Kevin McHale 1980.0 82.0 20.1 10.0 4.3 8.1 53.3 0.0 ... 0.3 1.8 1.3 11.4 F 7 1 208 95 1957
218 1522 James Donaldson 1980.0 68.0 14.4 5.3 1.9 3.5 54.2 0.0 ... 0.1 1.1 1.0 8.0 C 1 0 218 124 1957

219 rows × 29 columns

id year rank overall_pick team player college years_active games minutes_played ... fg3_pct ft_pct mp_per_g pts_per_g trb_per_g ast_per_g ws ws_per_48 bpm vorp
252 325 1995 1 1 GSW Joe Smith Maryland 16.0 1030.0 27022.0 ... 0.238 0.790 26.2 10.9 6.4 1.0 60.3 0.107 -1.5 3.0
253 326 1995 2 2 LAC Antonio McDyess Alabama 15.0 1015.0 28053.0 ... 0.117 0.670 27.6 12.0 7.5 1.3 69.8 0.119 -0.1 13.2
254 327 1995 3 3 PHI Jerry Stackhouse UNC 18.0 970.0 30222.0 ... 0.309 0.822 31.2 16.9 3.2 3.3 52.4 0.083 0.3 17.4
255 328 1995 4 4 WSB Rasheed Wallace UNC 16.0 1109.0 36243.0 ... 0.336 0.721 32.7 14.4 6.7 1.8 105.1 0.139 2.2 38.4
256 330 1995 6 6 VAN Bryant Reeves Oklahoma State 6.0 395.0 12071.0 ... 0.074 0.703 30.6 12.5 6.9 1.6 13.0 0.052 -3.4 -4.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1181 1751 2019 9 9 WAS Rui Hachimura Gonzaga 3.0 147.0 4184.0 ... 0.287 0.829 30.1 13.5 6.1 1.8 1.8 0.060 -3.2 -0.4
1182 1752 2019 10 10 ATL Cam Reddish Duke 3.0 133.0 3313.0 ... 0.332 0.802 26.7 10.5 3.7 1.5 -0.4 -0.011 -4.2 -0.9
1183 1753 2019 11 11 MIN Cameron Johnson UNC 3.0 183.0 4422.0 ... 0.390 0.807 22.0 8.8 3.3 1.2 2.7 0.102 0.3 0.7
1185 1755 2019 13 13 MIA Tyler Herro Kentucky 3.0 175.0 5294.0 ... 0.389 0.870 27.4 13.5 4.1 2.2 1.6 0.050 -1.6 0.1
1186 1756 2019 14 14 BOS Romeo Langford Indiana 3.0 98.0 1424.0 ... 0.185 0.720 11.6 2.5 1.3 0.4 0.2 0.025 -4.9 -0.3

284 rows × 46 columns

Data Filtering¶

For the question "Is there a difference in performance trajectory between lottery picks and second-round picks?", we created a function to filter out the players who only played less than 100 games and are from 14 to 31 (exclusive) in their overall pick position because we only want to include lottery and second round picks. We also selected the columns that were needed in the analysis to make the computations less expensive.

In [7]:
def set_data(data, col):
    """
    Given a Dataframe and a list of column strings used to select the required columns in the
    dataset, returns a filtered Dataframe which only contains the players from lottery and 
    second-round picks, who played more than 100 games.
    """
    data["lottery"] = 0
    data.loc[data["overall_pick"] <= 14, "lottery"] = 1
    data = data.loc[data["games"] > 100, :]
    criteria = (data["overall_pick"] <= 14) | (data["overall_pick"] >= 31)
    data = data.loc[criteria, :]
    return data[col + ["lottery"]]

columns_needed = [
    "overall_pick",
    "player",
    "games",
    "average_minutes_played",
    "points_per_game",
    "win_shares",
    "win_shares_per_48_minutes",
    "box_plus_minus",
    "value_over_replacement"
]

lottery_second = set_data(draft_data, columns_needed)

Data Normalization¶

To compare performance between lottery picks and second-round picks, we first needed to ensure that all player statistics were on a consistent scale. Since performance metrics such as points, rebounds, and minutes played can vary greatly, direct comparisons of raw values could be misleading.

To address this, we normalized each performance statistic by transforming it into a z-score that measures how much each player's value deviates from the mean in units of standard deviation. This process helps standardize the data, allowing us to fairly compare players across different statistical categories and generate more meaningful analysis.

In [8]:
def normalize(data, columns):
    """
    Given the players Dataframe and the selected statistic columns string list, returns a
    Dataframe containig additonal columns of the normalized performance statistics
    so all of the values are in the same scale.
    """
    for col in columns:
        stats = data[col]
        summary = stats.describe()
        std = summary["std"]
        mean = summary["mean"]
        data["norm_" + col] = (stats - mean) / std
    return data

stat_col = columns_needed[3:]
lottery_second = normalize(lottery_second, stat_col)

Testing Data Methods¶

In [9]:
%%ipytest
def test_load_data(): 
    result_all_star, result_rookie, result_lottery, result_draft = load_data(
        "NBA Rookies by Year.csv", "bbref-scraped-lotteryData.csv",
        "NBA Hall of Famers 2021.csv", "nbaplayersdraft.csv")
    expected_rookie = pd.read_csv("NBA Rookies by Year.csv").dropna()
    expected_lottery = pd.read_csv("bbref-scraped-lotteryData.csv")
    expected_all_star = pd.read_csv("NBA Hall of Famers 2021.csv").dropna()
    expected_draft = pd.read_csv("nbaplayersdraft.csv").dropna()

    assert_frame_equal(result_rookie, expected_rookie)
    assert_frame_equal(result_lottery, expected_lottery)
    assert_frame_equal(result_all_star, expected_all_star)
    assert_frame_equal(result_draft, expected_draft) 

def test_merge(): 
    expected_merge = nba_data_test1.merge(nba_data_test2, on="Name", how="inner")
    result = merge_data(nba_data_test1, nba_data_test2, "Name", "inner") 
    assert_frame_equal(expected_merge, result) 
    
def test_set_data():
    result = set_data(draft_data_test3, columns_needed)

    # Only players with games > 100
    filtered = result[result["games"] <= 100]
    assert filtered.shape[0] == 0, "There are players with 100 or fewer games"

    # Only lottery (<=14) or second round (>=31) picks
    filtered = result[(result["overall_pick"] > 14) & (result["overall_pick"] < 31)]
    assert filtered.shape[0] == 0, "There are players with mid first-round picks (15–30)"

    # Column names match expected
    expected_columns = columns_needed + ["lottery"]
    assert list(result.columns) == expected_columns, "Column names do not match expected"

test_data = normalize(draft_data_test3, stat_col)

def test_normalize():
    # Check if all normalized columns exist
    for col in stat_col:
        norm_col = "norm_" + col
        if norm_col not in test_data:
            assert False, "Missing normalized column: " + norm_col

    # Check if mean of norm_points_per_game is close to 0
    mean_value = test_data["norm_points_per_game"].mean()
    assert abs(mean_value) < 0.0001, "Mean of norm_points_per_game is not close to 0"

    # Check if std of norm_points_per_game is close to 1
    std_value = test_data["norm_points_per_game"].std()
    assert abs(std_value - 1) < 0.0001, "Std of norm_points_per_game is not close to 1"
....                                                                                         [100%]
4 passed in 0.07s

Results¶

Does a rookie's statistics predict long-term success?¶

To answer this question, we first needed to define what "long-term success" means in the context of an NBA career. We chose to measure it by the number of All-Star selections a player received in their career. All-Star selections are determined by a combination of votes from NBA fans, players, and media members, who collectively choose the league’s top performers. As such, being selected as an All-Star reflects both strong statistical performance and widespread recognition, making it a reasonable rerpesentation of long-term success.

To determine if a player's performance in the rookie year of their career determines the number of All-star selections they receive, we decided to generate a linear regression model. The linear regression model will help us to predict the number of All-star selections using a combination of player statistics. We define the features of the model as the major statistical categories (points, rebounds, assists, steals, blocks) and the response variable to be the number of All-star selections.

In [10]:
X = rookie_all_star[["PTS", "REB", "AST", "STL", "BLK"]]
y = rookie_all_star["All_star_selections"]
reg = LinearRegression().fit(X, y)

print("Model:", " + ".join([f"{reg.intercept_:.2f}"] + 
                           [f"{coef:.2f}({X.columns[i]})" for i, coef in enumerate(reg.coef_)]))
print("Error:", mean_squared_error(y, reg.predict(X)))
Model: 0.35 + 0.16(PTS) + -0.17(REB) + 0.09(AST) + 1.54(STL) + 1.52(BLK)
Error: 10.902783482726088

Below, I create a series of regression plots to explore the linear relationship between various rookie statistics (points, assists, rebounds, steals, and blocks) and long-term success, measured by All-Star selections. This helps visually assess which performance metrics may be stronger predictors of future recognition.

In [11]:
def plot_lm(data, predictor, target):
    """
    Creates a grid of regression plots to visualize the linear relationship between multiple 
    predictor variables (player statistics) and a target variable.
    """
    fig, ([ax1, ax2, ax3], [ax4, ax5, ax6]) = plt.subplots(figsize=(12,6), nrows=2, ncols=3)
    axes = [ax1, ax2, ax3, ax4, ax5]
    colors = ["#6495ED", "#000080", "#DE3163", "#FF7F50", "#800080"]
    for i in range(len(predictor)):
        ax = axes[i]
        pred = predictor[i]
        color = colors[i]
        sns.regplot(data, x=pred, y=target, ax=ax, color=color)
        ax.set_title(pred + " model")
        ax.set_xlabel("Avg " + pred + " per game")
        ax.set_ylabel("All Star Selections")
    fig.subplots_adjust(hspace=0.6, wspace=0.4)
    return ax1, ax2, ax3, ax4, ax5

plot_lm(rookie_all_star, ["PTS", "REB", "AST", "STL", "BLK"], "All_star_selections")
Out[11]:
(<Axes: title={'center': 'PTS model'}, xlabel='Avg PTS per game', ylabel='All Star Selections'>,
 <Axes: title={'center': 'REB model'}, xlabel='Avg REB per game', ylabel='All Star Selections'>,
 <Axes: title={'center': 'AST model'}, xlabel='Avg AST per game', ylabel='All Star Selections'>,
 <Axes: title={'center': 'STL model'}, xlabel='Avg STL per game', ylabel='All Star Selections'>,
 <Axes: title={'center': 'BLK model'}, xlabel='Avg BLK per game', ylabel='All Star Selections'>)
No description has been provided for this image

Results and Interpretation¶

The resulting linear regression model suggests a relationship between rookie statistics and the number of All-star selections, but it is not a highly accurate relationship, as indicated by the high mean squared error of 10.90. This means that our predictions often deviate considerably from the true number of All-star selections.

Among the list of predictors, steals and blocks had the strongest positive coefficients of 1.54 and 1.52, suggesting that the defensive performance in a player's rookie year is more closely associated with long-term recognition and success than offensive stats like points or assists. This finding is quite surprising because scoring is often considered the most visible indicator of talent in the NBA, with the highest scorers usually receiving the most recognition. One possible explanation is that standout defenders are able to fill unique roles that are harder to replace. Most players in the NBA can score, but not everyone has the ability to block shots or get steals, making their impact stand-out amongst other players.

On the other hand, rebounds had a slightly negative coefficient of -0.17, which might suggest that high rebounding alone, without the complement of other skills, doesn't strongly contribute to being recognized as a top-tier player.

Even though the model captures some trends between rookie statistics and All-star selections, the high prediction error also suggests that rookie stats alone are not sufficient to predict long-term success. Many other factors, such as injury and team context, likely play a significant role in shaping a player’s career trajectory.

Is there a difference in performance trajectory between lottery picks and second-round picks?¶

The question explores whether NBA drafted players have different performance patterns in lottery and second round picks. Lottery picks are the players who are chosen with draft position 1-14. The second round selections are those chosen with positions either greater than or equal to 31. Instead of applying individual numerical assessments to assess the statistics for several players, I developed a composite performance score. Furthermore, I chose particular variables that show their overall capacity but not a specific strength since every player has different roles in the game. The objective of this analysis is to determine whether the performance results really depend on the draft round and quantify how much initial draft position influences a player's long-term career. Higher averages and more variability in lottery-pick data could indicate the differences in either development, opportunity, or talent level between the two groups.

Instead of analyzing each type of statistic, I decided to combine the scores and create a total performance score for each player, which allows me to compare the overall level of ability for the players.

In [12]:
def performance(data, columns):
    """
    Given the players Dataframe and the normalized statistic columns string list, returns a
    Dataframe with an additional column presenting the total performance score.
    """
    total = 0    
    for stat in columns:
        total += data[stat]
    data["performance"] = total
    return data


norm_cols = ["norm_" + stat for stat in stat_col]
lottery_second = performance(lottery_second, norm_cols)

Plots¶

The following three plots are created to evaluate the difference in performance scores with unique perspectives and approaches.

In [13]:
def lottery_second_round_performance_scatter(data):
    """
    Given the players Dataframe, returns the list of plots showing the performance score of
    lottery-pick players and second-round-pick players by overall pick position.
    """
    sub_data = data[["overall_pick", "lottery", "performance"]]
    picks = ["lottery-pick", "second-round-pick"]
    subs = [sub_data[sub_data["lottery"] == 1],
            sub_data[sub_data["lottery"] == 0]]
    colors = ["#E9C165", "#7B9DC6"]
    fig, ax = plt.subplots(figsize = (13, 6), ncols = 2, nrows = 1)
    for i in range(2):
        axis = ax[i]
        sns.regplot(subs[i], x="overall_pick", y="performance", ax=axis, color=colors[i])
        axis.set(xlabel = "overall pick",
                 ylabel = "performance",
                 title = "Performance of " + picks[i] + " players")
    return ax


lottery_second_round_performance_scatter(lottery_second)
Out[13]:
array([<Axes: title={'center': 'Performance of lottery-pick players'}, xlabel='overall pick', ylabel='performance'>,
       <Axes: title={'center': 'Performance of second-round-pick players'}, xlabel='overall pick', ylabel='performance'>],
      dtype=object)
No description has been provided for this image

The two scatter plots produced above demonstrates the performance scores for each group with increasing draft position. The first plot, which has the performance scores from lottery-pick players, shows a downward trend, meaning the order of draft has a negative effect on players performance. On the other hand, although there is also a minimal negative relationship suggested by the trend line in the second graph, the performance of the players from second-round picks does not have a significant difference with changing draft order. Overall, both plots show high variance in scores, the lottery plot shows a predictive trend, earlier pick has a better score, but there is weak correlation between the variables of the second-round plot.

In [14]:
def lottery_second_round_performance_distribution(data):
    """
    Given the players Dataframe, returns 2 histograms showing the distribution of performance
    score counts for lottery-pick and second-round-pick players.
    """
    sub_data2 = data[["performance", "lottery"]]
    plot2 = sns.displot(sub_data2, x = "performance", col = "lottery")
    plot2.set(xlabel = "performance", ylabel = "Count")
    return plot2


lott_sec__dist = lottery_second_round_performance_distribution(lottery_second)
No description has been provided for this image

Above, we created the distributions for the two groups of players, which provided additional insights into our research question. The narrow shape of the second-round plot suggests the performance scores are concentrated, with fewer players outperforming the others and most having scores less than 5. However, the scores of lottery pick players are more spread, with more players having scores greater than 5 and even 10, and outperforming the average. Additionally, in the lottery plot, it is interesting how the scores are distributed with two extremes around -2 and 2, showing greater diversity among the players.

These histograms indicate a clear difference in the performance outcomes, suggesting the impact of lottery and second-round picks on the players.

In [15]:
def lottery_second_round_performance_box(data):
    """
    Given the players Dataframe, returns 2 boxplots visually showing the summary statistics
    of performance scores for lottery-pick and second-round-pick players.
    """
    sub_data3 = data[["performance", "lottery"]]
    plot3 = sns.catplot(sub_data3, x = "performance", col = "lottery", kind = "box")
    return plot3


lott_sec_box = lottery_second_round_performance_box(lottery_second)
No description has been provided for this image

The boxplots clearly illustrate the range of performance scores of the two groups. According to the interquartile range, the scores for second-round-pick players mostly range from -5 to -1, and the range is -2 to 5 for lottery-pick players. In general, both plots show predictive statistics, with the lottery plot having a higher median and the second-round plot having a median below 0. In short, most lottery-pick players outperform and second-round-pick players underperform the average.

Testing¶

Below I test the performance method created to determine the performance scores for each player.

In [16]:
%%ipytest

def test_performance():
    test_data = normalize(draft_data_test3, stat_col)
    result = performance(test_data, norm_cols)

    # Check if "performance" column exists
    if "performance" not in result:
        assert False, "The 'performance' column was not added."

    # Check if performance values are sums of normalized columns
    for i in range(len(result)):
        total = 0
        for col in norm_cols:
            total = total + result[col][i]
        if abs(total - result["performance"][i]) > 0.0001:
            assert False, "Incorrect performance score at row " + str(i)

    # Check if row count is unchanged
    if len(result) != len(draft_data_test3):
        assert False, "Row count changed after performance()"
.                                                                                            [100%]
1 passed in 0.02s

Results and Interpretation¶

According to this analysis, the outcomes present relatively higher performance statistic for the lottery players compared to the scores of second-round players, suggesting that the draft positions are likely to be one of the factors that determines long term career performance. The lottery group generally outperformed the average showing a higher variance among the players, although the second-round group also demonstrates a negative relationship between draft position and performance level, the greater slope and higher median in the plots means a possibility of greater opportunity and development for those selected early. Lottery players could thus get more playing time, improved coaching, and more organizational investment among other possible benefits. Though there are exceptional second-round players like Nikola Jokić, the general trend shows a higher probability of success with earlier draft positions. Draft position can thus be quite important in determining the course of performance of an NBA player.

Which college or international leagues produce the most statistically successful NBA rookies?¶

For this question, we used the Lottery dataset. We defined statistically successful in our analysis with six metrics: average career points per game, average career minutes per game, career field goal shooting percentage, 3-point shooting percentage, free throw shooting percentage, and career win shares per 48 minutes adjusted for career length.

We defined a player as statistically successful/high performance if they achieved the highest value for one of the six metrics in a given year between 1995-2019. We then matched the player who achieved this highest performance statistic to the college they played basketball at. We measured the number of players that achieved a highest performance statistic and each instance of achieving the highest performance statistic in one of the six metrics in a given year, keeping track of multiple achievements in one year or multiple years of achievements in one of the six metrics mentioned previously. One player can represent multiple instances if they were high performing in multiple areas or years. We emphasize a variety of skills that define high performance and look at varying levels of high performance. Lottery picks with high performance in multiple areas or for multiple years can indicate greater career success.

First, we clean the DataFrame to keep only the column data that we use for this research question and remove players from the DataFrame that are missing data needed to create our visualizations. Because our dataset does not contain necessary statistics from the players in 2020, the analysis is only on players during 1995-2019.

In [17]:
def clean_lottery(lottery):
    """
    Given the lottery DataFrame, returns a cleaned lottery DataFrame for the research
    question "Which college or international leagues produce the most statistically
    successful NBA players?".
    """
    lottery = lottery.loc[:, ["player", "Year", "college_name", "pts_per_g", "mp_per_g",
                "fg_pct","fg3_pct", "ft_pct", "ws_per_48"]]
    return lottery.dropna(subset=["college_name", "mp_per_g"]).reset_index(drop=True)


lottery_clean = clean_lottery(lottery)

Next, we created a new DataFrame containing only each player's name, college, and a marker for every instance they achieved the highest performance statistic for one of the six metrics for each year during 1995-2019.

In [18]:
def high_performance_colleges_and_players(bball_df):
    """
    Given the cleaned lottery DataFrame, returns a DataFrame with colleges producing lottery picks
    with the highest performance statistics for each category and a marker for each instance of 
    achieving a highest performance statistic during 1995-2019.
    """
    column_list = ["pts_per_g", "mp_per_g", "fg_pct", "fg3_pct", "ft_pct", "ws_per_48"]
    college_list = []
    player_list = []
    for column in column_list:
        grouped_df = bball_df.dropna(subset=column_list).groupby("Year")[[column]].idxmax()
        for num in list(grouped_df.loc[:, column]):
            college_list.append(bball_df.loc[num, "college_name"])
            player_list.append(bball_df.loc[num, "player"])
    college_player_df = pd.DataFrame({"college_name": college_list, "player": player_list,
                        "count": np.ones(len(college_list))})
    return college_player_df


college_and_player_df = high_performance_colleges_and_players(lottery_clean)

Next, we created a new DataFrame containing only each player's name, college, and a marker for every instance they achieved the highest performance statistic for one of the six metrics for each year during 1995-2019.

In [19]:
def high_performing_players_table(bball_df):
    """
    Given the cleaned lottery DataFrame, returns a table displaying a college name and the number
    of players who achieved a highest performance statistic in a given year that attended that
    college during 1995-2019.
    """
    bball_df = bball_df.groupby("college_name", as_index=False)["player"].nunique()
    colors = ["rgb(215, 243, 255)"]
    table = go.Figure(data=[go.Table(header=dict(values=["College Name",
                "Number of Highest Performing Players"], line_color=["black", "black"]),
                cells=dict(values=[list(bball_df["college_name"]), list(bball_df["player"])],
                line_color=["black", "black"], fill_color=[colors,colors]))])                 
    return table


high_performing_players_table(college_and_player_df)

To visualize the each college's contribution of high performing lottery picks in the NBA, we created a pie chart that displays the percentage of high performance instances achieved by players of a certain college out of all high performance instances. Each count represents an instance during 1995-2019 where a player achieved the highest performance statistic for that year, including multiple achievements in one year or multiple years of achievements in one of the six metrics mentioned previously. One player can represent multiple counts if they were high performing in multiple areas or years.

In [20]:
def college_high_performance_pie_chart(bball_df):
    """
    Given the cleaned lottery DataFrame, returns a pie chart displays the percentage of high
    performance instances achieved by players of a certain college out of all high performance
    instances in a given year during 1995-2019.
    """    
    pie_chart = px.pie(bball_df, values="count", names="college_name",
        title="College Counts of Highest Performance Statistics per Year 1995-2019")
    pie_chart.update_traces(textposition='inside')
    return pie_chart


college_high_performance_pie_chart(college_and_player_df)

To highlight players who achieved the highest performance statistic in multiple categories or years, we created a bar chart displaying players who achieved the highest performance statistic in at least two different instances. We chose to highlight this because multiple instances of high performance can indicate more sustained success in a player's career.

In [21]:
def college_with_best_rookies(bball_df):
    """
    Given the cleaned lottery DataFrame, returns a bar chart displaying players who had two
    highest performance statistics for a given year or one highest performance statistic in two
    different years during 1995-2019, with each bar colored by the player's college.
    """
    bball_df = bball_df.groupby(["player", "college_name"], as_index=False)[["count"]].sum()
    bball_df = bball_df[bball_df["count"] >= 2]
    bar_chart = px.bar(bball_df, x="player", y="count", color="college_name",
            title="Players with Two or More Highest Performance Instances 1995-2019",
            labels={"player": "Player Name", "count": "High Performance Counts",
            "college_name": "College Name"})
    return bar_chart


college_with_best_rookies(college_and_player_df)

We created a bubble chart to provide a scalar comparison of each college's contribution of high performing lottery picks in the NBA. Each count represents an instance during 1995-2019 where a player achieved the highest performance statistic for that year, including multiple achievements in one year or multiple years of achievements by one player in one of the six metrics mentioned previously. Each college's bubble is sized and colored by the number of counts.

In [22]:
def college_high_performance_bubble_chart(bball_df):
    """
    Given the cleaned lottery DataFrame, returns a plotly bubble chart displaying the number of
    times any lottery pick from each college achieved a highest performance statistic in a given
    year during 1995-2019.
    """
    bball_df = bball_df.groupby("college_name", as_index=False)[["count"]].sum()
    bubble_chart = px.scatter(
        bball_df, x="college_name", y="count", size = "count", color="count",
        title="College Counts of Highest Performance Statistics per Year 1995-2019",
        labels={"college_name": "College Name", "count": "High Performance Counts"})
    return bubble_chart


college_high_performance_bubble_chart(college_and_player_df)

Testing¶

In [23]:
%%ipytest
def test_high_performance_colleges_and_players():
    test_college_and_player_df1 = high_performance_colleges_and_players(
        nba_test_lottery_performance)
    assert_frame_equal(test_college_and_player_df1, pd.DataFrame({"college_name":
        ["University of Kansas", "UC Berkeley", "Duke", "Duke", "Duke", "Duke", "Duke", "Duke",
        "Duke", "Duke", "Duke", "Duke"], "player":["Joel Embiid", "Jaylen Brown",
        "Brandon Ingram", "William Avery", "Cherokee Parks", "William Avery", "Cherokee Parks",
        "William Avery", "Brandon Ingram", "William Avery", "Cherokee Parks", "William Avery"],
        "count": np.ones(12)}))

def test_lottery_clean():
    test_lottery_clean_df = clean_lottery(nba_test_lottery_performance)
    assert_frame_equal(test_lottery_clean_df, pd.DataFrame({"player":["Brandon Ingram",
        "Joel Embiid", "Cherokee Parks", "Hasheem Thabeet", "Jaylen Brown", "William Avery"],
        "Year": [2008, 2008, 2008, 2009, 2010, 2010], "college_name": ["Duke",
        "University of Kansas", "Duke", "UConn", "UC Berkeley", "Duke"], "pts_per_g": [12.1, 15.5,
        13.0, 12.1, 13.2, 5.7], "mp_per_g": [10.0, 9.8, 7.7, 10.0, 8.7, 11.3], "fg_pct": [0.50,
        0.47, 0.66, 0.50, 0.51, 0.72], "fg3_pct": [0.24, 0.19, 0.30, np.nan, 0.25, 0.33],
        "ft_pct": [0.80, 0.65, 0.52, 0.78, 0.67, 0.72], "ws_per_48": [0.31, 0.43, 0.60, 0.32,
        0.09, 0.40]}))
..                                                                                           [100%]
2 passed in 0.02s

Results and Interpretation¶

For our research question "Which college or international leagues produce the most statistically successful NBA players?", we determined that Duke produced the most lottery picks and had the largest share of high performance statistics during 1995-2019. We found that Duke produced 9 players and 15 instances (10% of all instances) of the highest performing lottery picks per year in the categories of points per game, minutes per game, sucessful field goal percentage, 3-point percentage, free throw percentage, and career win shares per 48 minutes. Additionally, Duke produced the most players with multiple (2 or more) highest performance statistics in a given year.

The results were mostly consistent with our initial predictions. Duke University has a strong background in college basketball, which is largely attributed to their previous long-term basketball coach: Mike Krzyzewski. Duke University's website states that Mike Krzyzewski's record includes five college basketball national championships, six gold medals as head coach of the U.S. Men’s National Team, 28 NBA lottery picks, and 68 NBA draft selections. This record provides a consistent background behind Duke University's success with producing high performing rookies.

Aside from Duke University's former basketball coach, they also possess a variety of resources that put their basketball players at an advantage. According to a study from Duke University, the average expenses per basketball player on the men's team is $\$1,329,949$. Comparatively, the average expenses per basketball player on the women's team is $\$476,342.9$. Duke University's budget provides the men's basketball team with resources like new equipment that allow them to perform their best, thus making it more likely for them to get drafted for the NBA.

Overall, Duke University's history and abundance of resources is likely a contributor to their success in producing high performing players.

Implications and Limitations¶

One Criteria for Determining College to Attend: One limitation of our analysis is that it only provides insight on one criteria for which colleges prospective student athletes should pick. When high school students decide which colleges to commit to, one factor in the college decision process is sport career outcomes from each school because it can be an indicator of their career trajectory. Committing to a college for athletics involves dedicating a lot time and effort. Understanding which colleges can maximize a student athlete's efforts furthers their career potential. However, career outcomes are not the only indicator of a student's compatibility with a school. They should consider other factors like academics, team chemistry, coaching styles, and location. These factors aside from career outcomes can influence a player's college athletic performance, affecting their likelihood of being drafted for the NBA. College athletes that do not plan on playing professionally may choose a college based on the other factors that are more relevant to their future plans. Thus, our analysis excludes prospective college athletes who prioritize sport career outcomes less than other factors listed above.

Not Position Specific: Another limitation of our analysis is that it doesn't emphasize specific statistics that are more important in projecting career outcomes from rookie performance and background. Because our analysis doesn't weigh performance statistics as more or less important, it may not provide the most accurate analysis of a rookie's later career outcome. Our analysis does not concentrate on performance statistics in relation to each NBA player's preferred position. Basketball has five main positions, and players may try to specialize in a certain set of skills that make them better at a certain position. For example, a point guard may not focus on improving or demonstrating their shooting skills because their main focus is passing. Rookies looking to maximize their career potential in a specific basketball position should cautiously use our conclusions to make decisions. If a rookie wants to understand performance statistics' impact on their career in a certain basketball position, they can look that the linear regression visualizations on important performance statistics for their desired position. If they use our conclusions as a broad generalization for performance statistics to improve, they may be harmed because they will not focus on the skills that make them an expert at their desired position.

Basketball as a Dynamic Sport: The third limitation of our analysis is that it does not consider basketball's evolution throughout time. The basketball game has changed over the past few decades. There is debate online about the level of skill required to be successful in the NBA now and how previous NBA players' skills compare to players today. A valid question to pose is "Is there a difference in performance statistics levels between basketball players today and basketball players from several decades ago?" It would be useful to determine if there is a certain threshold for points scored, assists, rebounds, blocks, etc. that is required to be successful in the NBA in previous decades and if it remains the same today. Because our analysis does not use inference testing, researchers can use our conclusions to shape the questions behind their own inference tests on rookie performance statistics on long-term career success in relation to time. This helps aspiring professional athletes because it provides statistical evidence that certain rookie statistics correlate, cause, or are associated with long-term career success in a modern context.